Joe Clarke & Ahyoung Lim
From cumulative to incident case counts in the PAHO PLISA Health Information Platform for the Americas dengue database
It is important to note that the cumulative datasets are for reports of “all dengue cases : suspected, probable, confirmed, non-severe and severe cases, and deaths. probable dengue cases : person who has a fever or history of fever for 2-7 days duration, two or more symptoms of dengue and one serological test positive or epidemiological nexus with confirmed dengue case 14 days before onset of symtpoms”.
This dataset, while it has “select epidemiological week” slider options for all available epidemiological weeks of every yer, the epidemiological week for which information is actually available/reported can differ from that on the slider. Some countries have missing weeks of data.The the extent of missing data varies greatly between countries, and over time. While moving from cumulative to incident weekly counts for countries with 100% reporting of sequential cumulative counts may be straightforward, by subtracting the previous weeks count from the following weeks count, this is not so for countries with missing data. There is already incident data available through the “Chart” and “Epi Curve” options, however this is for non-severe cases of dengue only. This is not the same as “total of dengue cases” as measured in the cumulative table, which is our priority for this dataset.
We want to investigate if this incident dataset can be leveraged to gap-fill the cumulative dataset when moving to incident counts, and what the difference between the two is.
PLISA PAHO portal only permits download of all countries cumulative data sets in a week by week fashion, by moving the epidemiological week slider while selecting all countries. These require the file format to be encoded differently, and re-saved in this format for further processing.
The files are loaded and a new directory created for processed data.
file_list<-list.files("raw_data/source_data"); file_list
old_dir<-paste0("raw_data/source_data")
new_dir<-paste0("raw_data/formatted_data")
#Encoding & save again
for(j in 1:length(file_list)){
oldfiledir <- paste0(old_dir, "/", file_list[j])
newfiledir <- paste0(new_dir, "/", file_list[j])
fileConn <- file(newfiledir)
writeLines(readLines(con <- file(oldfiledir, encoding = "UCS-2LE")), fileConn)
close(fileConn)
}
csv_files <- list.files(new_dir, full.names=T, recursive=F, pattern="\\.csv$")
temp <- lapply(csv_files, read.table, sep="\t", colClasses = rep("character",16), header=TRUE)
merge1 <- rbindlist(temp[1:53]) %>% filter(!(Year == "2022" & as.numeric(EW) > 35))
merge2 <- rbindlist(temp[54:76])%>% filter(!Year == "2023")
merge <- rbind(merge1, merge2)
#Column rename and select
all_dt <- merge %>%
select(Country.or.Subregion, Serotype, Year, Confirmed, Deaths,
Epidemiological.Week..a., Severe.Dengue..d., Total.of.Dengue.Cases..b.) %>%
rename(adm_0_name = Country.or.Subregion,
year = Year,
epidemiological_week = Epidemiological.Week..a.,
severe_dengue_cumulative = Severe.Dengue..d.,
dengue_total_cumulative = Total.of.Dengue.Cases..b.) %>%
filter(!epidemiological_week %in% c("-"))
#convert epidemiological week and year to numeric
all_dt$epidemiological_week <- as.numeric(all_dt$epidemiological_week)
all_dt$year <- as.numeric(all_dt$year)
#remove duplicates
all_dt <- all_dt %>%
distinct()%>%
arrange(adm_0_name, year, epidemiological_week)
#add the end date of the cumulative count for each row ("cum_end_date").
for(i in 1:nrow(all_dt)){
all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
}
# add cum_start_date based on the cum_end_date of the previous row. This is done by lagging cum_end_week because there may be cases where the cumulative count is not reported for consecutive weeks.
all_dt <- all_dt %>%
arrange(adm_0_name, year, epidemiological_week)%>%
group_by(adm_0_name, year)%>%
mutate(cum_start_date = ymd(lag(cum_end_date))+1)
#create "complete" dataset which has a row for every epidemiological week for every year, with NA for the dengue_total count if no data reported for this week. Make sure number of epidemiological weeks in each year is correct according to epidemiological calendars for each year (some have 52 weeks, others have 53).
all_dt <- all_dt %>% ungroup()%>%
tidyr::complete(adm_0_name, year, epidemiological_week)%>%
filter(!(year %in% c(2015:2019, 2021:2022) & epidemiological_week == 53))
all_dt %>%
group_by(adm_0_name) %>% tally()
#add cum_end_date again for rows that were missed from the original data but were filled by 'complete' function above. Add the start/end date of epidemiological week for all rows("calendar_start_date" and "calendar_end_date").
for(i in 1:nrow(all_dt)){
all_dt$calendar_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
if (is.na(all_dt$cum_end_date[i])==FALSE) next
all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
}
# add cum_start_date, calendar_start_date and source_id
all_dt <- all_dt %>%
mutate(cum_end_date = ymd(cum_end_date))%>%
mutate(cum_start_date = ifelse(is.na(cum_start_date), as.character(ymd(cum_end_date)-6), paste0(cum_start_date)))%>%
mutate(cum_interval = interval(cum_start_date, cum_end_date) %/% weeks(1)+1)%>%
mutate(calendar_end_date = ymd(calendar_end_date))%>%
mutate(calendar_start_date = ymd(calendar_end_date)-6)%>%
mutate(source_id= ifelse(is.na(dengue_total_cumulative)==FALSE, paste0("all_PAHO_countries_PAHO_2014_to_2022_EW",epidemiological_week), NA))
Each data set is then processed into two data hierarchies, primary (A) and secondary (B), with calendar dates added. Epidemiological week and year remain, as these are useful for processing later on.
The case counts are left in cumulative counts at this stage. This may be useful for users who prefer to deal in cumulative counts. here, the calendar start date is fixed at the beginning of epidemiological week one of each year, and the calendar end date is moved forward to account for the cumulative count.
#select columns for master table_A and change to date format
all_dt_A <- all_dt %>%
select(adm_0_name, adm_0_code, year, epidemiological_week,
calendar_start_date, calendar_end_date,
cum_start_date, cum_end_date, cum_interval,
dengue_total_cumulative, source_id
)%>%
mutate(calendar_start_date = ymd(calendar_start_date),
calendar_end_date = ymd(calendar_end_date),
cum_start_date = ymd(cum_start_date),
cum_end_date = ymd(cum_end_date))
all_dt_A_tidy <- all_dt_A %>%
distinct() %>%
mutate_at(c(2), ~replace(., is.na(.), 0)) %>%
group_by(adm_0_name, year) %>%
arrange(adm_0_name, year, epidemiological_week)
#Check admin codes and fill gaps
admin_NA <- all_dt_A_tidy %>%
dplyr::filter(adm_0_code==0)
write.csv(all_dt_A_tidy,"processed_data/all_dt_A.csv", row.names = FALSE)
#Dataset B/Secondary dataset, which contains information on : disease severity, serotype, mortality.
all_dt_B <- all_dt %>%
select(adm_0_name, adm_0_code,
calendar_start_date, calendar_end_date,
dengue_total_cumulative, Deaths, Serotype, severe_dengue_cumulative, source_id)%>%
rename(dengue_total=dengue_total_cumulative,
severe_dengue=severe_dengue_cumulative)%>%
mutate(calendar_start_date = ymd(calendar_start_date),
calendar_end_date = ymd(calendar_end_date))%>%
filter(dengue_total!= "NA")
all_dt_B_tidy <- all_dt_B %>%
distinct() %>%
mutate_at(c(2), ~replace(., is.na(.), 0))%>%
group_by(adm_0_name) %>%
arrange(desc(calendar_start_date))
admin_NA <- all_dt_B_tidy %>%
dplyr::filter(adm_0_code==0)
#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (UK)"]<-"39"
#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (US)"]<-"258"
all_dt_B_tidy <- all_dt_B_tidy %>%
pivot_longer(c(severe_dengue, dengue_total, Deaths),
names_to="dengue_classification", values_to="total")
all_dt_B_tidy <- all_dt_B_tidy[!(all_dt_B_tidy$dengue_classification == ""), ]
all_dt_B_tidy <- all_dt_B_tidy %>%
select(adm_0_name, adm_0_code, calendar_start_date, calendar_end_date,
dengue_classification, total, Serotype, source_id)
write.csv(all_dt_B_tidy,"processed_data/all_dt_B.csv", row.names = FALSE)
Since some cumulative counts may be revised down over time, we need to identify the rows in our dataset that contain such values. To achieve this, we will use a for loop. For instance, if the cumulative count for week 5 was initially 100 but revised down to 90 in week 6, we will tag week 5 as having revised down values. Cumulative count for week 5 was compared with all subsequent rows (e.g., week 6 to week 53).
data$result <- FALSE
#split the dataset by country and year (16 countries * 9 years = 144 datasets) and save them into a list
dt_list <- data %>%
group_by(adm_0_name, year)%>%
group_split()
# iterate over each dataset
for (x in 1:length(dt_list)){
df <- dt_list[[x]]
if (nrow(df) < 2) next
# iterate over each row
for (i in 1:(nrow(df) - 1)) {
# if ith row contains NA, skip this row and go to next row
if (is.na(df$dengue_total_cumulative[i])) next
# iterate over each subsequent row
for (j in (i + 1):nrow(df)) {
# skip the row if it's NA and compare with the next row instead
if (is.na(df$dengue_total_cumulative[j])) next
# compare ith and jth row and if ith row contains value that is larger than a value in jth value, return "TRUE" in "result" column.
if (df$dengue_total_cumulative[i] > df$dengue_total_cumulative[j]) {
dt_list[[x]]$result[i] <- TRUE
# testing for loop
# print(paste0("I compared ", i, "th row and ", j, " row of dt_list ", x))
}
}
}
# the last row is not compared, so set its value to NA
#dt_list[[x]]$result[nrow(df)] <-NA
}
# merging datasets back
data2 <- rbindlist(dt_list)
# cumulative count that has been revised will be replaced with NAs.
data2 <- data2 %>%
mutate(dengue_total_cumulative2 = ifelse(result ==TRUE, NA, dengue_total_cumulative))%>%
mutate(after_revision = ifelse(result == TRUE, "Revised_NA", original))
When the total sum of weekly = annual count (assuming annual count data is correct), NA values were identified and replaced with count values for the following three use cases.
source(here::here('function', 'leveraging_annual_data.R'))
zero1 <- data2 %>%
mutate(key = paste0(adm_0_name, "_", year))%>%
mutate(dengue_total_cumulative3 = ifelse(key %in% v1$key, 0, dengue_total_cumulative2))%>%
mutate(after_zero1 = after_revision)
zero1$after_zero1[is.na(zero1$after_revision) & zero1$dengue_total_cumulative3==0] = "Zero1"
plyr::count(zero1$after_zero1)
summary(data2$dengue_total_cumulative2) # NA=8978
summary(zero1$dengue_total_cumulative3) # NA=8679 (decreased from 8978)
# 2) is the latest data point = last week of the year?
# filling in 0s for the weeks after the latest cumulative count when compare==0
x <- zero1 %>%
mutate(key2 = paste0(adm_0_name, "_", year, "_", epidemiological_week))%>%
merge(., v2[c("key", "latest_week", "latest_count")], by=c("key"), all.x=T, all.y=F)%>%
filter(!(epidemiological_week <= latest_week))%>%
filter(key %in% v2$key & after_zero1 %in% c("Revised_NA", NA, "Zero1"))%>%
arrange(adm_0_name, year, epidemiological_week)%>%
mutate(dengue_total_cumulative3 = as.numeric(latest_count))%>%
mutate(after_zero2 = "Zero2")
zero2 <- zero1 %>%
mutate(key2 = paste0(adm_0_name, "_", year, "_", epidemiological_week))%>%
anti_join(x, by = "key2")
zero2 <- bind_rows(x, zero2)
zero2$after_zero2[is.na(zero2$after_zero2)] <- as.character(zero2$after_zero1[is.na(zero2$after_zero2)])
summary(zero2$dengue_total_cumulative3) # NA=7648 (decreased from 8679)
plyr::count(zero2$after_zero2)
# Find duplicated cumulative counts where NAs exist in between, and replace NAs with the duplicated count values.
# create a lookup table for duplicated non-NA cumulative count values.
dup <- zero2 %>%
group_by(adm_0_name, year)%>%
mutate(dup = as.numeric(duplicated(dengue_total_cumulative3) | duplicated(dengue_total_cumulative3, fromLast = TRUE)))%>%
filter(!is.na(dengue_total_cumulative3))%>%
filter(dup==1)%>%
select(adm_0_name, year, epidemiological_week, dengue_total_cumulative3)
# mark in which epidemiological week the duplicated count starts and ends. For example, if the cum count values for week 10 to 15 are 10, 12, NA, NA, 12, 13, 13, min_week = 11 and max_week = 14.
dup2 <- dup %>%
mutate(key = paste0(adm_0_name, "_", year))%>%
group_by(key, dengue_total_cumulative3)%>%
summarise(min_week = min(epidemiological_week),
max_week = max(epidemiological_week))
# convert wide to long format and complete sequence column and fill in rows.
dup3 <- dup2 %>%
gather(., var, epidemiological_week, min_week:max_week)%>%
arrange(key, epidemiological_week)%>%
group_by(key, dengue_total_cumulative3)%>%
complete(epidemiological_week = first(epidemiological_week):max(epidemiological_week)) %>% select(-var)%>%
rename(dup_count = dengue_total_cumulative3)
# if there are duplicated cumulative counts where any NA values exist in between, those NAs will be replaced with cum count values
zero3 <- zero2 %>%
mutate(key = paste0(adm_0_name, "_", year))%>%
merge(., dup3, by=c("key", "epidemiological_week"), all.x=T, all.y=F)%>%
mutate(dengue_total_cumulative4 = ifelse(is.na(dengue_total_cumulative3) & is.na(dup_count)==FALSE,
as.numeric(dup_count), dengue_total_cumulative3))%>%
# mutate(source_cat = ifelse(is.na(dengue_total_cumulative2) & is.na(dup_count)==FALSE, "Zero", source_cat))%>%
mutate(after_zero3 =ifelse(is.na(dengue_total_cumulative3) & is.na(dup_count)==FALSE, "Zero3", after_zero2))
summary(zero3$dengue_total_cumulative4) #NA 7040 (decreased from 7648)
Create epidemiological week 0 row to permit imputation for weeks where epidemiological week has no reported cumulative count.
imp <- zero3 %>%
as_tibble() %>%
group_by(adm_0_name) %>%
group_modify(~ add_row(.x,year=2014:2022,
epidemiological_week = 0,
dengue_total_cumulative4 = 0 ,.before = 0))%>%
arrange(adm_0_name, year, epidemiological_week)
imp <- imp %>%
group_by(adm_0_name, year)%>%
mutate(dengue_na_spline = as.integer(na.spline(dengue_total_cumulative4, na.rm=FALSE, maxgap=2, method="hyman")))%>%
ungroup()
# maxgap=2 argument is used to specify that the function can fill in missing values up to two consecutive positions in the vector. rule=2 argument was used to ensure that the interpolated values are always greater than or equal to the observed values in the vector.
# Extract the number of consecutive imputed values for each year and each country
consec <- data.frame(unclass(rle(is.na(imp$dengue_total_cumulative4))))
# The 'lengths' column indicates the number of consecutive rows and the 'values' column indicates whether cumulative counts are NAs (values==TRUE) or not.
# Highlight those year & epidemiological weeks where there are >= 6 imputed values consecutively, so that these can be discarded moving forward (as we have less confidence in imputed values over periods of longer than 6 weeks)
consec <- consec %>%
mutate(values2 = ifelse(lengths < 6 & values == TRUE, FALSE, values))
consec <- as.data.frame(lapply(consec, rep, consec$lengths))
imp <- cbind(imp, consec)
imp <- imp %>%
# for gaps of 6 weeks of more, replace with NAs
mutate(dengue_na_spline2 = ifelse(values2==TRUE, NA, dengue_na_spline)) %>%
filter(!epidemiological_week==0)%>% #remove epidemiological week = 0
select(adm_0_name, adm_0_code, year, epidemiological_week,
calendar_start_date, calendar_end_date,
cum_start_date, cum_end_date, cum_interval,
dengue_total_cumulative,
dengue_total_cumulative2, # after revised down
dengue_total_cumulative3, # after zero 1 and 2
dengue_total_cumulative4, # after zero 3
dengue_na_spline2, # after imputation
result, source_id, original, after_revision, after_zero1, after_zero2, after_zero3)%>%
mutate(calendar_start_date = ymd(calendar_start_date),
calendar_end_date = ymd(calendar_end_date),
cum_start_date = ymd(cum_start_date),
cum_end_date = ymd(cum_end_date))
#For data gaps of >= 6 weeks, we will need to base the calendar start date on the previous calendar start date for which we have data. This will mean that these periods present incident counts but over a longer time period, according to the size of the gap.
# As of now, we have imputed cumulative count values ("dengue_na_spline2") that have not been missed for longer than six weeks.
imp <- imp %>%
mutate(after_imp = ifelse((after_zero3 %in% c(NA, "Revised_NA")) & is.na(dengue_na_spline2)==FALSE, "Imputed", after_zero3))
plyr::count(imp$after_imp)
## x freq
## 1 Imputed 229
## 2 original 15462
## 3 Revised_NA 151
## 4 Zero1 287
## 5 Zero2 1031
## 6 Zero3 608
## 7 <NA> 6672
Prepare dataset for converting cumulative to incident counts. This is more straightforward for data where we have consecutive epidemiological weeks, as a lag column can be created and subtracted from the previous weeks count.
#create incident count column by subtracting the previous weeks cumulative count from the following weeks cumulative count
inc_dt <- imp %>%
arrange(adm_0_name, year, epidemiological_week)%>%
group_by(adm_0_name, year)%>%
# mutate(dengue_total_lag = lag(dengue_total_cumulative4, default=0))%>%
# mutate(dengue_total_inc = dengue_total_cumulative4-dengue_total_lag)%>%
mutate(dengue_na_spline_lag = lag(dengue_na_spline2, default = 0)) %>%
mutate(dengue_na_spline_inc = dengue_na_spline2-dengue_na_spline_lag) %>% ungroup()
The following example shows dengue incidence cases in Argentina in 2017 before and after zero filling and imputation.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
Let’s see how the data coverage has been updated so far. The figure below shows the weekly data coverage (2014-2022) by sources (raw, imputed, revised, not available) and country. Imputation filled in the gap, whereas revision may have created more. NA refers to the case when the original raw data was not available; Revised-NA refers to the case when the original raw data was available but has been replaced with NAs because they were revised down over time.
So in summary, what changes have been made to the original raw datasets?
## # A tibble: 1 × 3
## original n prop
## <chr> <int> <dbl>
## 1 NA 8768 35.9
## # A tibble: 1 × 3
## after_revision n prop
## <chr> <int> <dbl>
## 1 NA 8978 36.7
## # A tibble: 1 × 3
## after_zero3 n prop
## <chr> <int> <dbl>
## 1 NA 7052 28.9
## # A tibble: 1 × 3
## after_imp n prop
## <chr> <int> <dbl>
## 1 NA 6823 27.9